Unicode Canonical Decomposition for Hangeul Syllables in Regular Expression

نویسندگان

  • Hee Yuan Tan
  • Hyotaek Lim
چکیده

Owing to the high expressiveness of regular expression, it is frequently used in searching and manipulation of text based data. Regular expression is highly applicable in processing Latin alphabet based text, but the same cannot be said for Hangeul∗, the writing system for Korean language. Although Hangeul possesses alphabetic features within the script, expressiveness of regular expression pattern using Hangeul is hindered by the absence of syllable decomposition. Without decomposition support in regular expression, searching through Hangeul text is limited to string literal matching. Literal matching has made enumeration of syllable candidates in regular expression pattern definition indispensable, albeit impractical, especially for a large set of syllable candidates. Although the existing implementation of canonical decomposition in Unicode standard does reduce a pre-composed Hangeul syllable into smaller unit of consonantvowel or consonant-vowel-consonant letters, it still leaves quite a number of the individual letters in compounded form. We have observed that there is a necessity to further reduce the compounded letters into unit of basic letters to properly represent the Korean script in regular expression. We look at how the new canonical decomposition technique proposed by Kim can help in handling Hangeul in regular expression. In this paper, we examine several of the performance indicators of full decomposition of Hangeul syllable to better understand the overhead that might incur, if a full decomposition were to be implemented in a regular expression engine. For efficiency considerations, we propose a semi decomposition technique alongside with a notation for defining Hangeul syllables. The semi decomposition functions as an enhancement to the existing regular expression syntax by taking in some of the special constructs and features of the Korean language. This proposed technique intends to allow an end user to have a greater freedom to define regular expression syntax for Hangeul. key words: regular expression, Hangeul, Unicode, NFD, Korean script

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Coding Partitions of Regular Sets

A coding partition of a set of words partitions this set into classes such that whenever a sequence, of minimal length, has two distinct factorizations, the words of these factorizations belong to the same class. The canonical coding partition is the finest coding partition that partitions the set of words in at most one unambiguous class and other classes that localize the ambiguities in the f...

متن کامل

Canonical decomposition of catenation of factorial languages

According to a previous result by S. V. Avgustinovich and the author, each factorial language admits a unique canonical decomposition to a catenation of factorial languages. In this paper, we analyze the appearance of the canonical decomposition of a catenation of two factorial languages whose canonical decompositions are given.

متن کامل

Canonical Decomposition of Polynomial Ideals

V.Ortiz established in [10] the existence of a canonical decomposition of ideals in a commutative noetherian ring. In this paper we study the canonical decomposition of ideals in a polynomial ring and we give an algorithmic procedure to compute canonical decompositions.

متن کامل

Discovering Regularities in Databases Using Canonical Decomposition of Binary Relations

Regularities in databases are directly useful for knowledge discovery and data summarization. As a mathematical background, relational algebra helped for discovering the main data structures and existing dependencies between the different attributes in a relational database. Functional, difunctional and other kinds of dependencies in a relational database describe invariant regular structures t...

متن کامل

Quasi-isometries preserve the geometric decomposition of Haken manifolds

We prove quasi-isometry invariance of the canonical decomposition for fundamental groups of Haken 3-manifolds with zero Euler characteristic. We show that groups quasi-isometric to Haken manifold groups with nontrivial canonical decomposition are finite extensions of Haken orbifold groups. As a by-product we describe all 2-dimensional quasi-flats in the universal covers of non-geometric Haken m...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • IEICE Transactions

دوره 94-D  شماره 

صفحات  -

تاریخ انتشار 2011